AITopics

2502.15072

Country:

South America > Brazil (0.04)
North America > United States > Ohio > Franklin County > Columbus (0.04)
North America > United States > Kentucky (0.04)
(4 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
Education (1.00)
Banking & Finance (1.00)
Transportation (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

arXiv.org Machine LearningFeb-3-2025

Online Gradient Boosting Decision Tree: In-Place Updates for Efficient Adding/Deleting Data

Lin, Huawei, Chung, Jun Woo, Lao, Yingjie, Zhao, Weijie

Gradient Boosting Decision Tree (GBDT) is one of the most popular machine learning models in various applications. However, in the traditional settings, all data should be simultaneously accessed in the training procedure: it does not allow to add or delete any data instances after training. In this paper, we propose an efficient online learning framework for GBDT supporting both incremental and decremental learning. To the best of our knowledge, this is the first work that considers an in-place unified incremental and decremental learning on GBDT. To reduce the learning cost, we present a collection of optimizations for our framework, so that it can add or delete a small fraction of data on the fly. We theoretically show the relationship between the hyper-parameters of the proposed optimizations, which enables trading off accuracy and cost on incremental and decremental learning. The backdoor attack results show that our framework can successfully inject and remove backdoor in a well-trained model using incremental and decremental learning, and the empirical results on public datasets confirm the effectiveness and efficiency of our proposed online learning framework and optimizations.

artificial intelligence, decision tree learning, machine learning, (19 more...)

2502.01634

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.28)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Santa Clara County > San Jose (0.04)
(22 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Education (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceOct-2-2024

DynFrs: An Efficient Framework for Machine Unlearning in Random Forest

Wang, Shurong, Shen, Zhuoyang, Qiao, Xinbao, Zhang, Tongning, Zhang, Meng

Random Forests are widely recognized for establishing efficacy in classification and regression tasks, standing out in various domains such as medical diagnosis, finance, and personalized recommendations. These domains, however, are inherently sensitive to privacy concerns, as personal and confidential data are involved. With increasing demand for the right to be forgotten, particularly under regulations such as GDPR and CCPA, the ability to perform machine unlearning has become crucial for Random Forests. However, insufficient attention was paid to this topic, and existing approaches face difficulties in being applied to real-world scenarios. Addressing this gap, we propose the DynFrs framework designed to enable efficient machine unlearning in Random Forests while preserving predictive accuracy. Dynfrs leverages subsampling method Occ(q) and a lazy tag strategy Lzy, and is still adaptable to any Random Forest variant. In essence, Occ(q) ensures that each sample in the training set occurs only in a proportion of trees so that the impact of deleting samples is limited, and Lzy delays the reconstruction of a tree node until necessary, thereby avoiding unnecessary modifications on tree structures. In experiments, applying Dynfrs on Extremely Randomized Trees yields substantial improvements, achieving orders of magnitude faster unlearning performance and better predictive accuracy than existing machine unlearning methods for Random Forests.

conference paper, random forest, time complexity, (14 more...)

arXiv.org Artificial Intelligence

2410.01588

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > Singapore (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Immunology (0.94)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Calzavara, Stefano, Cazzaro, Lorenzo, Vettori, Massimo

Timber! Poisoning Decision Trees

arXiv.org Machine LearningOct-1-2024

We present Timber, the first white-box poisoning attack targeting decision trees. Timber is based on a greedy attack strategy leveraging sub-tree retraining to efficiently estimate the damage performed by poisoning a given training instance. The attack relies on a tree annotation procedure which enables sorting training instances so that they are processed in increasing order of computational cost of sub-tree retraining. This sorting yields a variant of Timber supporting an early stopping criterion designed to make poisoning attacks more efficient and feasible on larger datasets. We also discuss an extension of Timber to traditional random forest models, which is useful because decision trees are normally combined into ensembles to improve their predictive power. Our experimental evaluation on public datasets shows that our attacks outperform existing baselines in terms of effectiveness, efficiency or both. Moreover, we show that two representative defenses can mitigate the effect of our attacks, but fail at effectively thwarting them.

dataset, decision tree, poisoning attack, (13 more...)

2410.00862

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.14)
Europe > Italy > Veneto > Venice (0.04)
(22 more...)

Genre: Research Report (0.50)

Industry:

Materials > Paper & Forest Products (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceMay-18-2023

Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance

Zhang, Zheyu, Zhang, Tianping, Li, Jian

Gradient Boosting Decision Tree (GBDT) has achieved remarkable success in a wide variety of applications. The split finding algorithm, which determines the tree construction process, is one of the most crucial components of GBDT. However, the split finding algorithm has long been criticized for its bias towards features with a large number of potential splits. This bias introduces severe interpretability and overfitting issues in GBDT. To this end, we provide a fine-grained analysis of bias in GBDT and demonstrate that the bias originates from 1) the systematic bias in the gain estimation of each split and 2) the bias in the split finding algorithm resulting from the use of the same data to evaluate the split improvement and determine the best split. Based on the analysis, we propose unbiased gain, a new unbiased measurement of gain importance using out-of-bag samples. Moreover, we incorporate the unbiased property into the split finding algorithm and develop UnbiasedGBM to solve the overfitting issue of GBDT. We assess the performance of UnbiasedGBM and unbiased gain in a large-scale empirical study comprising 60 datasets and show that: 1) UnbiasedGBM exhibits better performance than popular GBDT implementations such as LightGBM, XGBoost, and Catboost on average on the 60 datasets and 2) unbiased gain achieves better average performance in feature selection than popular feature importance methods. The codes are available at https://github.com/ZheyuAqaZhang/UnbiasedGBM.

artificial intelligence, decision tree learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2305.10696

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > Hungary > Hajdú-Bihar County > Debrecen (0.04)
Europe > France > Normandy > Seine-Maritime > Rouen (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

#artificialintelligenceFeb-7-2022, 12:40:27 GMT

Cheat-Sheet: Decision Trees Terminology

Now, that we know the basic building blocks of a decision tree, we need to know how to grow one. Creating a decision tree describes the process of dividing the input space into several distinct, non-overlapping sub-spaces. In order to divide the input space, we have to test all features and threshold values to find the optimal split that minimizes our cost function. Once we obtain the best split, we can continue to grow our tree recursively. The process is termed recursive since each sub-space may be split an indefinite number of times until a stopping criterion (e.g.

best split, cheat-sheet, decision tree terminology, (4 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.93)

#artificialintelligenceMar-9-2020, 06:24:47 GMT

Decision Trees Explained

In this post, I will explain Decision Trees in simple terms. It could be considered a Decision Trees for dummies post, however, I've never really liked that expression. In the Machine Learning world, Decision Trees are a kind of non parametric models, that can be used for both classification and regression. This means that Decision trees are flexible models that don't increase their number of parameters as we add more features (if we build them correctly), and they can either output a categorical prediction (like if a plant is of a certain kind or not) or a numerical prediction (like the price of a house). They are constructed using two kinds of elements: nodes and branches.

decision tree, leave node, node, (13 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

arXiv.org Machine LearningJul-3-2019

An Experimental Evaluation of Large Scale GBDT Systems

Fu, Fangcheng, Jiang, Jiawei, Ying, Shaoxia, Cui, Bin

Gradient boosting decision tree (GBDT) is a widely-used machine learning algorithm in both data analytic competitions and real-world industrial applications. Further, driven by the rapid increase in data volume, efforts have been made to train GBDT in a distributed setting to support large-scale workloads. However, we find it surprising that the existing systems manage the training dataset in different ways, but none of them have studied the impact of data management. To that end, this paper aims to study the pros and cons of different data management methods regarding the performance of distributed GBDT. We first introduce a quadrant categorization of data management policies based on data partitioning and data storage. Then we conduct an in-depth systematic analysis and summarize the advantageous scenarios of the quadrants. Based on the analysis, we further propose a novel distributed GBDT system named Vero, which adopts the unexplored composition of vertical partitioning and row-store and suits for many large-scale cases. To validate our analysis empirically, we implement different quadrants in the same code base and compare them under extensive workloads, and finally compare Vero with other state-of-the-art systems over a wide range of datasets. Our theoretical and experimental results provide a guideline on choosing a proper data management policy for a given workload.

artificial intelligence, information management, machine learning, (19 more...)

doi: 10.14778/3342263.3342273

1907.01882

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

#artificialintelligenceOct-31-2018, 23:27:27 GMT

Machine Learning Basics - Random Forest

RF is based on decision trees. In machine learning decision trees are a technique for creating predictive models. They are called decision trees because the prediction follows several branches of "if… then…" decision splits - similar to the branches of a tree. If we imagine that we start with a sample, which we want to predict a class for, we would start at the bottom of a tree and travel up the trunk until we come to the first split-off branch. This split can be thought of as a feature in machine learning, let's say it would be "age"; we would now make a decision about which branch to follow: "if our sample has an age bigger than 30, continue along the left branch, else continue along the right branch".

artificial intelligence, decision tree learning, machine learning, (3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

#artificialintelligenceAug-13-2018, 08:28:42 GMT

Learn ML Algorithms by coding: Decision Trees – Lethal Brains

Let us build a crude decision tree which predicts the outcome in probabilities (In Scikit learn, predict method returns the predicted classes while the predict_proba method returns the predicted probabilities. What do you think would be most simple and easy way to predict the probabilities? I have touched it up a little bit. The fit method accepts a dataframe(data) and a string for the target attribute(target). Both of the them are then assigned to the object.

artificial intelligence, decision tree learning, machine learning, (15 more...)

Genre: Personal > Interview (0.55)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.69)